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For Markov random fields on Z d with finite state space, we ad- 
dress the statistical estimation of the basic neighborhood, the small- 
est region that determines the conditional distribution at a site on 
the condition that the values at all other sites are given. A modifi- 
cation of the Bayesian Information Criterion, replacing likelihood by 
pseudo- likelihood, is proved to provide strongly consistent estimation 
from observing a realization of the field on increasing finite regions: 
the estimated basic neighborhood equals the true one eventually al- 
most surely, not assuming any prior bound on the size of the latter. 
Stationarity of the Markov field is not required, and phase transition 
does not affect the results. 

1. Introduction. In this paper Markov random fields on the lattice Z d 
with finite state space are considered, adopting the usual assumption that 
the finite-dimensional distributions are strictly positive. Equivalently, these 
are Gibbs fields with finite range interaction; see [13]. They are essential in 
statistical physics, for modeling interactive particle systems [10], and also in 
several other fields [3], for example, in image processing [2]. 

One statistical problem for Markov random fields is parameter estima- 
tion when the interaction structure is known. By this we mean knowledge of 
the basic neighborhood, the minimal lattice region that determines the condi- 
tional distribution at a site on the condition that the values at all other sites 
are given; formal definitions are in Section 2. The conditional probabilities 
involved, assumed translation invariant, are parameters of the model. Note 
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that they need not uniquely determine the joint distribution on Z , a phe- 
nomenon known as phase transition. Another statistical problem is model 
selection, that is, the statistical estimation of the interaction structure (the 
basic neighborhood). This paper is primarily devoted to the latter. 

Parameter estimation for Markov random fields with a known interac- 
tion structure was considered by, among others, Pickard [19], Gidas [14, 15], 
Geman and Graffigne [12] and Comets [6]. Typically, parameter estimation 
does not directly address the conditional probabilities mentioned above, but 
rather the potential. This admits parsimonious representation of the condi- 
tional probabilities that are not free parameters, but have to satisfy alge- 
braic conditions that need not concern us here. For our purposes, however, 
potentials will not be needed. 

We are not aware of papers addressing model selection in the context of 
Markov random fields. In other contexts, penalized likelihood methods are 
popular; see [1, 21]. The Bayesian Information Criterion (BIC) of Schwarz 
[21] has been proven to lead to consistent estimation of the "order of the 
model" in various cases, such as i.i.d. processes with distributions from ex- 
ponential families [17], autoregressive processes [16] and Markov chains [11]. 
These proofs include the assumption that the number of candidate model 
classes is finite; for Markov chains this means that there is a known upper 
bound on the order of the process. The consistency of the BIC estimator of 
the order of a Markov chain without such prior bound was proved by Csiszar 
and Shields [8]; further related results appear in [7]. A related recent result, 
for processes with variable memory length [5, 22], is the consistency of the 
BIC estimator of the context tree, without any prior bound on memory 
depth [9]. 

For Markov random fields, penalized likelihood estimators like BIC run 
into the problem that the likelihood function cannot be calculated explicitly. 
In addition, no simple formula is available for the "number of free param- 
eters" typically used in the penalty term. To overcome these problems, we 
will replace likelihood by pseudo-likelihood, first introduced by Besag [4], 
and modify also the penalty term; this will lead us to an analogue of BIC 
called the Pseudo- Bayesian Information Criterion or PIC. Our main result 
is that if one minimizes this criterion for a family of hypothetical basic neigh- 
borhoods that grows with the sample size at a specified rate, the resulting 
PIC estimate of the basic neighborhood equals the true one eventually al- 
most surely. In particular, the consistency theorem does not require a prior 
upper bound on the size of the basic neighborhood. It should be empha- 
sized that the underlying Markov field need not be stationary (translation 
invariant), and phase transition causes no difficulty. 

An auxiliary result perhaps of independent interest is a typicality propo- 
sition on the uniform closeness of empirical conditional probabilities to the 
true ones, for conditioning regions whose size may grow with the sample 
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size. Though this result is weaker than analogous ones for Markov chains in 
[7], it will be sufficient for our purposes. 

The structure of the paper is the following. In Section 2 we introduce 
the basic notation and definitions, and formulate the main result. Its proof 
is provided by the propositions in Sections 4 and 5. Section 3 contains the 
statement and proof of the typicality proposition. Section 4 excludes over- 
estimation, that is, the possibility that the estimated basic neighborhood 
properly contains the true one, using the typicality proposition. Section 5 
excludes underestimation, that is, the possibility that the estimated basic 
neighborhood does not contain the true one, via an entropy argument and a 
modification of the typicality result. Section 6 is a discussion of the results. 
The Appendix contains some technical lemmas. 

2. Notation and statement of the main results. We consider the (i-dimen- 
sional lattice "L d . The points i G Z rf are called sites, and ||i|| denotes the 
maximum norm of i, that is, the maximum of the absolute values of the 
coordinates of i. The cardinality of a finite set A is denoted by |A|. The 
notation C and C of inclusion and strict inclusion are distinguished in this 
paper. 

A random field is a family of random variables indexed by the sites of the 
lattice, {X(i) : i G where each X(i) is a random variable with values in a 
finite set A. For A C Z d , a region of the lattice, we write X{A) = {X(i) : i G 
A}. For the realizations of X(A) we use the notation a(A) = {a(i) G A:i G 
A}. When A is finite, the |A|-tuples a(A) G A A will be referred to as blocks. 

The joint distribution of the random variables X(i) is denoted by Q. We 
assume that its finite-dimensional marginals are strictly positive, that is, 

Q(a(A)) = Prob{A(A) = a(A)} > for A C 1 d finite, o(A) G A A . 

The last standard assumption admits unambiguous definition of the condi- 
tional probabilities 

Q(o(A)|a(*)) = Prob{X(A) = a(A)| = a($)} 

for all disjoint finite regions A and <3?. 

By a neighborhood T (of the origin 0) we mean a finite, central-symmetric 
set of sites with ^ V. Its radius is r(T) = maxjgr ||*||- For any A C its 
translate when is translated to i is denoted by A*. The translate T l of a 
neighborhood T (of the origin) will be called the r-neighborhood of the site 
i; see Figure 1. 

A Markov random field is a random field as above such that there exists 
a neighborhood T, called a Markov neighborhood, satisfying for every i G Z d 

(2.1) Q(a(i)\a(A i )) = Q(a(i)\a(T i )) if A D T,0 ^ A, 

where the last conditional probability is translation invariant. 
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This concept is equivalent to that of a Gibbs field with a finite range 
interaction; see [13]. Motivated by this fact, the matrix 

Qr = {Qr(a\a(T)):ae A,a(T) gA t } 

specifying the (positive, translation-invariant) conditional probabilities in (2.1) 
will be called one-point specification. All distributions on A^ d that satisfy 
(2.1) with a given conditional probability matrix Q-p are called Gibbs dis- 
tributions with one-point specification Qr- The distribution Q of the given 
Markov random field is one of these; Q is not necessarily translation invari- 
ant. 

The following lemma summarizes some well-known facts; their formal 
derivation from results in [13] is indicated in the Appendix. 

Lemma 2.1. For a Markov random field on the lattice as above, there 
exists a neighborhood Tq such that the Markov neighborhoods are exactly 
those that contain Tq. Moreover, the global Markov property 

Q(a(A)\a(Z d \A)) = Q (a(A)\a^ (J I* \ Aj j 

holds for each finite region AcZ rf . These conditional probabilities are trans- 
lation invariant and uniquely determined by the one-point specification Qy ■ 

The smallest Markov neighborhood Tq of Lemma 2.1 will be called the 
basic neighborhood. The minimal element of the corresponding one-point 
specification matrix Qr is denoted by q m \ a : 

q m in= min Qr (a\a(T )) > 0. 

aeA,a(T )eA r 
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In this paper we are concerned with the statistical estimation of the basic 
neighborhood Tq from observation of a realization of the Markov random 
field on an increasing sequence of finite regions A„ C n £ N; thus the nth 
sample is x(A n ). 

We will draw the statistical inference about a possible basic neighborhood 
r based on the blocks a(T) G A T appearing in the sample x(A n ). For techni- 
cal reasons, we will consider only such blocks whose center is in a subregion 
A n of A n , consisting of those sites i G A n for which the ball with center i 
and radius log 1 ^ 2 ^ |A n | also belongs to A n : 

A„ = {i G A n : {j G Z d : ||t - j\\ < log 1 /^ |A n |} C A n }; 

see Figure 1. Our only assumptions about the sample regions A n will be that 

AiCA 2 C-; |A n |/|A n |^l. 

For each block a(T) G A r , let N n (a(T)) denote the number of occurrences 
of the block a(T) in the sample x(A n ) with the center in A n , 

N n (a(T)) = \{ieA n :T i C A n , x(r*) = o(r)}|. 

The blocks corresponding to r-neighborhoods completed with their centers 
will be denoted briefly by a(r,0). Similarly as above, for each a(T,0) G 
A ru {°} we write 

N n (a(T,0)) = \{i G A n :T C A^x^ U {»}) = a(T,0)}\. 

The notation a(T,0) G x(A n ) will mean that N n (a(T,0)) > 1. 

The restriction T l C A n in the above definitions is automatically satisfied if 
r(T) < log 1/(M) |A n |. Hence the same number of blocks is taken into account 
for all neighborhoods, except for very large ones: 

£ iv„(o(r)) = |A n | ifrcn^iogVP^i^i. 

a(r)eA r 

For Markov random fields the likelihood function cannot be explicitly 
determined. We shall use instead the pseudo-likelihood defined below. 

Given the sample x(A n ), the pseudo-likelihood function associated with 
a neighborhood T is the following function of a matrix Q' T regarded as the 
one-point specification of a hypothetical Markov random field for which Y 
is a Markov neighborhood: 

PL r (x(A n ),Q^)= [J Q'r(^)Kn) 

ieA n 

(2.2) 

= II Q'r(a(0)Wr)f nia{Tm - 

a(r,o)ex(A n ) 



G 



I. CSISZAR AND ZS. TALATA 



We note that not all matrices Q' r satisfying 



J2Q' T (a(0)\a(T)) = l, o(T) E A r 

are possible one-point specifications; the elements of a one-point specification 
matrix have to satisfy several algebraic relations not shown here. Still, we 
define the pseudo-likelihood also for Q' r not satisfying those relations, even 
admitting some elements of Q' r to be 0. 

The maximum of this pseudo-likelihood is attained for Q' r (a(0)\a(T)) = 
"T^TCTrJ) • Thus, given the sample x(A n ), the logarithm of the maximum 
pseudo-likelihood for the neighborhood T is 

(2.3) logMPL r (x(A n ))= £ N n (a(T, 0)) log 

a(r,0)Gx(A n ) " n {d(l)) 

Now we are able to formalize a criterion in analogy to the Bayesian In- 
formation Criterion that can be calculated from the sample. 



Definition 2.1. Given a sample x(A n ), the Pseudo-Bayesian Informa- 
tion Criterion, in short PIC, for the neighborhood T is 

PIC r (x(A n )) = -logMPL r (x(A n )) + \Af\ log |A n |. 

Remark. In our penalty term, the number |^4|' r ' of possible blocks 
a(r) E A r replaces "half the number of free parameters" appearing in BIC, 
for which number no simple formula is available. Note that our results re- 
main valid, with the same proofs, if the above penalty term is multiplied by 
any c > 0. 



The PIC estimator of the basic neighborhood Tq is defined as that hy- 
pothetical r for which the value of the criterion is minimal. An important 
feature of our estimator is that the family of hypothetical T's is allowed 
to extend as n — > oo, and thus no a priori upper bound for the size of the 
unknown Tq is needed. Our main result says the PIC estimator is strongly 
consistent if the hypothetical T's are those with r(T) < r n , where r n grows 
sufficiently slowly. 

We mean by strong consistency that the estimated basic neighborhood 
equals To eventually almost surely as n — > oo . Here and in the sequel, "even- 
tually almost surely" means that with probability 1 there exists a threshold 
no [depending on the infinite realization x(7* d )] such that the claim holds 
for all n > hq. 
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Theorem 2.1. The PIC estimator 



with 



satisfies 



r PIC 0(A n ))= argmin PIC r (x(A re )), 

r:r(r)<r n 



r n = o(log 1 /^\A n \), 



r P i C (x(A n )) = r 

eventually almost surely as n — > oo . 

Proof. Theorem 2.1 follows from Propositions 4.1 and 5.1 below. □ 

Remark. Actually, the assertion will be proved for r n equal to a con- 
stant times log 1 ^ 2 ^ | A n | . However, as this constant depends on the unknown 
distribution Q, the consistency can be guaranteed only when 

r n = (log 1 /( M )|A n |) = (log 1 /( M )|A n |). 

It remains open whether consistency holds when the hypothetical neigh- 
borhoods are allowed to grow faster, or even without any condition on the 
hypothetical neighborhoods. 

As a consequence of the above, we are able to construct a strongly con- 
sistent estimator of the one-point specification Qr - 

Corollary 2.1. The empirical estimator of the one-point specification, 

Q ? (a(0)|a(f )) = a(0) G A, a(f ) G A f , 

N n (a(T)) 

converges to the true Qr almost surely as n — > oo, where T is the PIC 
estimator Tpic • 

Proof. Immediate from Theorem 2.1 and Proposition 3.1 below. □ 
3. The typicality result. 

Proposition 3.1. Simultaneously for all Markov neighborhoods with 
r(T) < a 1 '^ log 1 ^ 2 ^ |A n | and blocks a(T,0) G ,4 ru W, 



JV„(a(r,0)) 

Q{a{0)\a(T)) 



< 



l K \ogN n (a(T)) 
N n (a(T)) 



N n (a(T)) 

eventually almost surely as oo, if 

0<a<l, K>2 3d ealog(|A| 2 + l). 



8 



I. CSISZAR AND ZS. TALATA 



To prove this proposition we will use an idea similar to the "coding tech- 
nique" of Besag [3] ; namely, we partition A n , into subsets such that the 
random variables at the sites i £ A^ are conditionally independent given the 
values of those at the other sites. First we introduce some further notation. 
Let 

(3.4) R n =[a^\\og\A n \\^\. 

We partition the region A n by intersecting it with sublattices of Z d such 
that the distance between sites in a sublattice is 4R n + 1. The intersections 
of A n with these sublattices will be called sieves. Indexed by the offset k 
relative to the origin 0, the sieves are 

A k n = {i& A n :i = k + (4R n + l)v,v£Z d }, \\k\\ <2R n ; 

see Figure 2. For a neighborhood T, let N^(a(T)) denote the number of 
occurrences of the block a(T) £ A T in the sample x(A n ) with center in A£, 

NH(a(T)) = \{i G A k n : V C A„, x(T l ) = a(T)}\. 

Similarly, let 

N k (a(T, 0)) = |{i € A* : r* C A n , s(F U {*}) = a(T, 0)}|. 

Clearly, 

N n (a(T))= N nH r )) and N n (a(T,0))= £ N*{a{T,0)). 

fc:||fc||<2.R„ fe:||fc||<2Ji„ 

The notation a(T) £ a: (A*) will mean that N k (a(T)) > 1. 

Denote by < I ) n(r) the set of sites outside the neighborhood T whose norm 
is at most 2R n , 

$ n (T) = {i£Z d :\\i\\ <2Rn,i(£T}; 

see Figure 2. <3?^(r) denotes the translate of 3> n (r) when is translated to 
i. 

For a finite region S C Z, d , conditional probabilities on the condition 
X(S) = G A a will be denoted briefly by Prob{- | x(H)}. 

In the following lemma the neighborhoods T need not be Markov neigh- 
borhoods. 

Lemma 3.1. Simultaneously for all sieves k, neighborhoods T with 
r(T) < R n and blocks a(T) £ A v , 

(l + e)logA^(a(r))>log|A n |, 

eventually almost surely as n — > oo, where e > is an arbitrary constant. 



log ™ |A„| 



41l„ I I 
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Fig. 2. The sieve A* 



Proof. As a consequence of Lemma 2.1, for any fixed sieve k and neigh- 
borhood r with r(r) < Rn, the random variables X(T % ), z E A^, are condi- 
tionally independent given the values of the random variables in the rest of 
the sites of the sample region A n . By Lemma A. 5 in the Appendix, 

g(o(T)|o(* B (r))) > gSL, a(*„(T)) E 

hence we can use the large deviation theorem of Lemma A. 3 in the Appendix 
|r| 

with = q m { n to obtain 

\ |A*| 2 9mi 



x A n \ y r 



< exp 



I a fc I ^mm 
I n\ lg 



Hence also for the unconditional probabilities, 



Prob 



^ 1 |r| 

2 ^mm 



< exp 



I A 



a |r| 
16 



Note that for n > no (not depending on fe) we have 



|A fc | > 



\Ar. 



> 



\A r . 



2 (4R n + l) d (5i2n) 



Using this and the consequence |r| < (2R n + if < {SR n ) d of r(r) < i? n , the 
last probability bound implies for u>uq 



N k (a(T)) q {3Rn)d 
Prob^ " _ ^ — < min 



I A* 



2{hR n ) d 



< exp 



|A r 



(3R n ) a 



l6{5R n ) d 
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Using the union bound and Lemma A. 6 in the Appendix, it follows that 



Prob 



N*(a(T)) ^ 



\Ar, 



2(5« n ) rf ' 

for some k,T,a(T) with \\k\\ < 2R n ,r(T) <R n ,a{T) G A r | 



< exp 



I a i "min 



(4R n + l) d -(\A\ 2 + l) 



(2R n +l) d /2 



Recalling (3.4), this is summable in n, and thus the Borel-Cantelli lemma 
gives 



N«(a(T))>\A n \ 



3 d aV2(i + i og |A„ |)i/2 
9min 

2- 5^/2(1+ bg |A n |)V2' 



eventually almost surely as n — > oo, simultaneously for all sieves k, neigh- 
borhoods r with r(r) < R n and blocks a(T) G A T . This proves the lemma. 
□ 

Lemma 3.2. Simultaneously for all sieves k, Markov neighborhoods Y 
with r(r) < R n and blocks a(T, 0) E A ru ^°>, 



JV*(a(T,0)) 



Q(o(0)|a(r)) 



< 



^log^A^r)) 



JV*(a(T)) 

eventually almost surely as n — > oo, i/ 

5>2 d ea 1/2 log(|A| 2 + 1). 

Proof. Given a sieve fc, a Markov neighborhood T and a block a(T, 0), 
the difference N%(a(T,0)) - N^{a{T))Q(a(0)\a(T)) equals 

y n = p[(A-(t) = o(o))-Q(o(o)|o(r))], 

iGA^:x(r i )=a(r) 

where I(-) denotes the indicator function; hence the claimed inequality is 
equivalent to 

- V /^(a(r))51og 1 / 2 iV^(a(r)) <Y n < Ntt(a(T))5 log 1 ' 2 N*{a{T)). 

We will prove that the last inequalities hold eventually almost surely as n — > 
oo, simultaneously for all sieves k, Markov neighborhoods T with r(T) < R n 
and blocks a(r,0) G ^4 ru ^ ^. We concentrate on the second inequality; the 
proof for the first one is similar. 
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Denote 

G j (k,a(T,0)) = \ max y n >^e%-V2\ 

where 

J\fj(k,a(T)) = {n:e> < N^(a{T)) < e j+1 ,(l + e) logiV*(o(r)) > log|A n |}; 
if neNj(k,a(T)), then by (3.4) 

i?„ = [a 1 /^) [log |A n |] V(2d)j < i/(M) (1 + (1 + e)(j + 1)} i/(2d) drf ^ 

(3-5) 

The claimed inequality Y n < \/ N^(a(T))5 log 1 ^ 2 N%(a(T)) holds for each 
n with e J ' < JV£(a(T)) < e j+1 if 



max y n < \ I e^dj 1 / 2 . 

n: ei<Arfe(a(r))<eJ+ 1 v 

By Lemma 3.1, the condition (1 + e) logiV*(a(r)) > log|A n | in the definition 
of A/j(fc, a(r)) is satisfied eventually almost surely, simultaneously for all 
sieves k, neighborhoods F with r(F) < R n and blocks a(F) G j4 r . Hence it 
suffices to prove that the following holds with probability 1: the union of the 
events Gj(k,a{T,0)) for all k with ||fe|| < 2R^\ all T D T with r(T) < 
and all a(r,0) G J 4 ru 'f J', obtains only for finitely many j. 
As n G Mj(k, a(T)) implies j < log |A n | < (1 + e)(j + 1), 

L(i+e)(?+i)J f . 

(3.6) a(I\0))C |J max Y n >\J eiSj 1 / 2 L 

l_neA/j,i(fc,a(r)) J 

where 

Nj,l(k, o(r)) = {n : e J ' < iV*(a(r)) < e»' +1 , Z < log |A n | < / + 1}. 

The random variables X(i), i G A^, are conditionally independent given 
the values of the random variables in their r-neighborhoods. Moreover, 
those A(i)'s for which the same block a(F) appears in their T-neighborhood 
are also conditionally i.i.d. Hence Y n is the sum of N^(a(F)) conditionally 
i.i.d. random variables with mean and variance 

\ > D 2 = Q(a(0)|o(r))[l - Q(o(0)|o(T))] > ± 9min . 

As R n is constant for n with I < log|A n | < I + 1, the corresponding Y n 's 
are actually partial sums of a sequence of (a(T)) < e^ +l such condition- 
ally i.i.d. random variables, where n* is the largest element of J\fj t i(k,a(T)). 
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Therefore, using Lemma A. 4 in the Appendix with /j, = jij = (1 — rj) y e~ 1 5j 1 / 2 , 
where rj > is an arbitrary constant, we have 



Prob< max Y n > \ e^dj 1 / 2 

lneA/i,i(fc,a(r)) v 



U r 

VieA^ :x(r i )=a(r) 



< Prob< max 



U r l 

VieA*i:x(r i )=a(r) 



<-exp 



2(1+ H /(2DV^ t )) 2 . 

On account of linij_ >00 fij/(2DVei +1 ) = 0, the last bound can be continued 
for j >j , as 



<-exp 



1/2 



2e(l + r?) 

This bound also holds for the unconditional probabilities, hence we obtain 
from (3.6), 

U-r?) 2 



Prob{Gj(k, a(T, 0))} < (ej + 2) • - exp 



2e(l + 7?) 



Sj 



1/2 



< 



cxp 



2e(l + ?7) 



1/2 



To bound the number of all admissible k, T, a(T,0) [recall the conditions 
H&H < r(T) < R®, with R& defined in (3.5)], note that the number 

of possible fc's is bounded by 

{4RM + l) d < (4 + p) d a l l 2 {\ + e) l ' 2 {j + l) 1 / 2 , 

and, by Lemma A. 6 in the Appendix, the number of possible blocks a(T, 0) 
with r(r) < is bounded by 

(U| 2 + 1)( 2R °' ) + 1 )'7 2 < (l^ia + 1 \(i+p) ,, 2'«- 1 a 1 / a (l+ e ) 1 / a (j+l) 1 / 2 _ 

Combining the above bounds, we get for the probability of the union of 
the events Gj(k,a(T,0)) for all admissible k, T, a(T, 0) the bound 

(1-7?) 3 



exp 



1/2 



2e(l + r?) 

+ [log(|^| 2 + 1)](1 + p) d 2 d - l aV\l + e) l ' 2 (j + l) 1 / 2 + OOogj 1 / 2 ; 
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This is summable in j if we choose 77, e, p sufficiently small, and S/(2e) > 
2 a! - 1 a 1 /2log(|^| 2 + l), that is, if 5 > 2 d ea 1 ' 2 \og{\A\ 2 + 1). □ 

Proof of Proposition 3.1. Using Lemma 3.2, 
JVn(a(r,0)) 



N n (a(T)) 



Q(o(o)|o(r)) 



< E 

ft: ||fc|l<2B T1 



iV n fc (a(r,0)) 



Q(a(0)\a(T)) 



N n (a{T)) 



< E 

k: ||fe||<2_R n 



5\og 1 / 2 N*(a{T)) N*(a{T)) 
\ iV n fc (a(r)) N n (a(T)) 



eventually almost surely as n — ► 00. By Jensen's inequality and N^(a(T)) < 
N n (a(T)), this can be continued as 



< 



\ 



<5(4i2 n + l) d log 1/2 A^(a(r)) 



N n (a(T)) 

By (3.4) and Lemma 3.1, we have for any e, p > and n sufficiently large, 
(4R n + l) d < (Aa 1/{2d) (l + log\A n \) 1/{2d) +l) d 

< (4 + p) V/ 2 (l + e) 1 ' 2 log 1 ' 2 N n (a(T)), 
eventually almost surely as n — > 00. This completes the proof. □ 

4. The overestimation. 

Proposition 4.1. Eventually almost surely as n — > 00, 

fpi C (x(A ri ))^{r:rDro}, 

whenever r n in Theorem 2.1 is egua/ to i? n m (3.4) wrai/i 

9min I ^4 1 1 



a < 



2 M e|A| 2 log(|^| 2 + l)' 



Proof. We have to prove that simultaneously for all neighborhoods 
r D r with r(T) < i? n , 

(4.7) PICr(rr(A n )) - PICr (x(A n )) > 0, 

eventually almost surely as n —> 00. 
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The left-hand side 

-logMPLr(x(A n )) + |A|l r llog|A n | + logMPLr (x(A„))-|A|l r °llog|A„| 
is bounded below by 

-logMPL r (a;(A n )) + logPL ro (x(A n ),Qr„) + (l - pr) |A|l r l log |A n |. 

Hence, it suffices to show that simultaneously for all neighborhoods r D Tq 
with r(r) < R n , 

(4.8) logMPL r (x(A n )) - logPL ro (x(A n ), Q r „) < ^jT^^' lo g l A -l, 

eventually almost surely as n — > oo. 

Now, for r D To we have PLr (x(A n ), Qr ) = PLr(x(A n ), Qr), by the 
definition (2.2) of pseudo-likelihood, since To is a Markov neighborhood. 
Thus the left-hand side of (4.8) equals 

logMPLr(x(A n )) -logPL r (x(A n ),Qr) 

- T n (a(T o)) log *n(q(r,o))/j\r n (q(r)) 

= £ JV B (a(T)) 
a(r)ex(A„) 

v JV ra (a(r,0)) iV w (a(r,0))/JV w (a(r)) 

\(o) :a (iW„) N ^)) ° S O(«(0)|a(T)) ' 

To bound the last expression, we use Proposition 3.1 and Lemma A. 7 in the 
Appendix, the latter applied with P(a(0)) = ^(ffiffl , Q(o(0)) = Q(o(0)|o(T)). 
Thus we obtain the upper bound 



a(T)ex(A n ) qmm a(0) : a(r,0)Sx(A„) 



^n(q(r,o)) n( m . 

AT , 7py\ - <3 a(0 a(r ) 



< £ W)) J-|A r l0 w S ^» < ^l|^|Wlcg|A„|. 
«(r)«(A„) *»>» ^("tT)) 9mln 

eventually almost surely as n — > oo, simultaneously for all neighborhoods 

r d r with r(r) < 

Hence, since |A n |/|A„| — > 1, the assertion (4.8) holds whenever 

</min \-™-\ 

which is equivalent to the bound on a in Proposition 4.1. □ 
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5. The underestimation. 



Proposition 5.1. Eventually almost surely as n — > oo 



r P i C (x(A n ))e{r:rDr } 



if r n in Theorem 2.1 is chosen as in Proposition 4.1. 

Proposition 5.1 will be proved using the lemmas below. Let us denote 



Lemma 5.1. The assertion of Proposition 3.1 holds also with T replaced 
by ru$o! where T is any (not necessarily Markov) neighborhood. 

Proof. As Proposition 3.1 was a consequence of Lemma 3.2, we have 
to check that the proof of that lemma works when the Markov neighbor- 
hood r is replaced by T U \&o, where V is any neighborhood. To this end, 
it suffices to show that conditional on the values of all random variables in 
the (r U ^(^-neighborhoods of the sites i € A^, those X(i), i G A^, are con- 
ditionally i.i.d. for which the same block a(T U \&o) appears in the (r U $?o)- 
neighborhood of i. This follows from Lemma A.l in the Appendix, with 
A = r U{0} and ^ = ^ - □ 

Lemma 5.2. Simultaneously for all neighborhoods T 2 To with r(T) < 



eventually almost surely as n — > oo . 

Proof. The claimed inequality is analogous to (4.7) in the proof of 
Proposition 4.1, the role of T D T there played by T U ^ D (L n T ) U * - 
Its proof is the same as that of (4.7), using Lemma 5.1 instead of Propo- 
sition 3.1. Indeed, the basic neighborhood property of To was used in that 
proof only to show that PLr (a;(A n ), Qr ) = PLr(£(A n ), Qr)- The analogue 
of this identity, namely 

PL( rnro ) U ^ (x(A n ),(5(rnro)u*o) = P L ru* ( x (An), Qru^ a ), 
follows from Lemma A.l in the Appendix with A = Tq U {0} and if = ^o- 




R 



PlC Tu9o (x(A n )) > PIC (rnro)u ^ (x(A n )), 



□ 



For the next lemma, we introduce some further notation. 
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The set of all probability distributions on A^ d , equipped with the weak 
topology, is a compact Polish space; let d denote a metric that metrizes it. 
Let Q G denote the (compact) set of Gibbs distributions with the one-point 
specification Qr - 

For a sample x(A n ), define the empirical distribution on A z by 

Rx,n = tt— i <^4 ' 

where x n G y4 z<i is the extension of the sample x(A n ) to the whole lattice with 
x n{j) equal to a constant a£ A for j G Z d \A n , and denotes the translate 
of x n when is translated to i and <5 X is the Dirac mass at x 6 A z . 

Lemma 5.3. With probability 1, d(R x>n , Q G ) — > 0. 

Proof. Fix a realization x(% d ) for which Proposition 3.1 holds. 

It suffices to show that for any subsequence ilk such that R x ,n k converges, 
its limit R x ,o belongs to Q G . 

Let r' be any neighborhood. For n sufficiently large, the (r' U {0})- 
marginal of R X)Tl is equal to 

W,o)) >a(r/)0)eAm o } 

|Aji| 

hence R x ,n k — ► Rx,o implies 

(5.9) WM_^ (a(r , 0)) 

l A «fe I 

for all a(r',0) G ^r'u-to}^ This and summation for a(0) G yl imply 

^(a(P)) — ^KOJKT)). 

As Proposition 3.1 holds for the realization x(Z d ), it follows that if r" is a 
Markov neighborhood, then 

^, o (a(0)|a(r')) = Q(a(0)|a(r')) = Qr o (a(0)|a(r o )). 

For any finite region A D To with ^ A, the last equation for a neighborhood 
r' D A implies that 

i^, o (a(0)|a(A)) = Qr o (a(0)|a(r o )) if A D r , ^ A. 

To prove i? Xi o G Q G it remains to show that, in addition, R Xj o(a(i)\a(A 1 )) = 
Qr ( a Wl a (Lo))- Actually, we show that R X) q is translation invariant. Indeed, 
given a finite region AcZ^ and its translate A*, take a neighborhood V 
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with A U A 1 C r' U {0}, and consider the sum of the counts N n (a(T' ,0)) 
for all blocks a(T',0) = {a(j):j GT'U {0}} with {a(j) : j £ A} equal to a 
fixed |A|-tuple and the similar sum with {a(j):j € A 1 } equal to the same 
|A|-tuple. If < log 1 ^ 2 ^ |A n |, the difference of these sums is at most 
|A n | — |A n |, hence the translation invariance of R x ,o follows by (5.9). □ 

Lemma 5.4. Uniformly for all neighborhoods V not containing Tq, 
- logMPL (rnro)u *p (x(A n )) > - logMPL ro (x(A n )) + c|A n |, 
eventually almost surely as n — ► oo, where c> is a constant. 

Proof. Given a realization x £ A^ d with the property in Lemma 5.3, 
there exists a sequence Q x ,n in Q G with 

d{Rx,m Qx,n) 0, 

and consequently 

(s.io) Arn ,y A)) -Q,,nKA))^o 

\A-n\ 

for each finite region AcZ** and a (A) € ^4 A . 

Next, let L be a neighborhood with T 2 To- By (2.3), 

- 7j— { log MPL (rnro)u ^ (x(A n )) 

= -TT-\ E iV n (a((L nL )UM/ ,0)) 

iy ra (a((rnr )u^o,o)) 

X ° g A^(a((FnL )U*o)) ' 

Applying (5.10) to A = (L n r ) U ^ U {0}, it follows that the last expression 
is arbitrarily close to 

J2 Qx,n«(X n T ) u * , o)) io g g a: , n (a(o)|a((r n r ) u *„)) 

a((rnro)u*ou{o}) 

= ^„(x(o)|x((rnr )u* )) 

if n is sufficiently large, where Hq x n (-\-) denotes conditional entropy, when 
the underlying distribution is Q x ,n- Similarly, — (l/|A n |) logMPLr (x(A n )) 
is arbitrarily close to H Qx n (X(0)\X(T )), which equals H Qx n (Z(0)|X(r U 
^o)) since Tq is a Markov neighborhood. 

It is known that H QI (X(0)\X((T n L ) U * )) > #Q'( X (°)W r o U tf )) 
for any distribution Q'. The proof of the lemma will be complete if we show 
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that, in addition, there exists a constant £ > (depending on r Pi Tq) such 
that for every Gibbs distribution Q G G Q G 

h qG (x(o)\x((t n r ) u *„)) - h qG (x(o)\x(t u *„)) > £. 

The indirect assumption that the left-hand side goes to for some sequence 
of Gibbs distributions in Q G implies, using the compactness of Q G , that 

H Q a(X(0)\X((T n r ) U VPo)) = H Q a(X(0)\X(T U *„)), 
for the limit Q G G Q G of a convergent subsequence. This equality implies 

<2<f (a(o)K(r n r ) u *„)) = Q{f (a(o)|a(r u *„)) 

for all a(0) G A, a(r U * ) G A r ° u *° . By Lemma A.l in the Appendix, these 
conditional probabilities are uniquely determined by the one-point specifi- 
cation Qr , and the last equality implies 

Q(a(i)\a((T n r ) 1 U = Q(a(i)\a(T U **)) = Qr (a(i)|a(rj,)). 

According to Lemma A. 2 in the Appendix, this would imply (V D To) U is 
a Markov neighborhood also, which is a contradiction, as (mi^U^o 2 To- 
This completes the proof of the lemma because there is only a finite 
number of possible intersections T n Tq. □ 

Proof of Proposition 5.1. We have to show that 
(5.11) PIC r (x(A n )) > PIC ro (x(A n )), 

eventually almost surely as n — > oo, for all neighborhoods T with r(T) < R n 
that do not contain Tq. 

Note that T 1 D T 2 implies MPL ri (x(A n )) > MPLr 2 (x(A n )), since 
MPLr(x(A n )) is the maximizer in Q' r of PLr(x(A n ), Qp); see (2.2). Hence 

-logMPL r (x(A n )) > -logMPL ru ^ (x(A n )) 

for any neighborhood T. 
Thus 

PIC r (x(A n )) = -logMPL r (x(A n )) + \Af\ log |A n | 

> PIC ru * (x(A n )) - (|yl|l ru *«l - \A\^) log |A n |. 

Using Lemma 5.2 and the obvious bound |ru ^ol < |P| + l^oli h follows 
that, eventually almost surely as n — ► oo for all T ^>Tq with r(r) < R n , 

PIC r (x(A n )) >PIC (rnro)u ^(x(A n ))- \A\\ r \(\A\\^\ - l)log|A n |. 

Here, by Lemma 5.4, 

PI C(rnr )ufo( x ( A «)) 

> -logMPL (rnro)u ^ (x(A n )) > -logMPLr (x(A n )) + c|A n |, 
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eventually almost surely as n — > oo for all T as above. This completes the 
proof, since the conditions r(T) < Rn and |A n |/|A n ,| — > 1 imply |^4|' r ' log |A n | = 

0(|An|). □ 

6. Discussion. A modification of the Bayesian Information Criterion 
(BIC) called PIC has been introduced for estimating the basic neighborhood 
of a Markov random field on Z d , with finite alphabet A. In this criterion, 
the maximum pseudo-likelihood is used instead of the maximum likelihood, 
with penalty term |^4|' r ' log |A n | for a candidate neighborhood T, where A. n 
is the sample region. The minimizer of PIC over candidate neighborhoods, 
with radius allowed to grow as c^log 1 ^ 2 ^ |A n |), has been proved to equal the 
basic neighborhood eventually almost surely, not requiring any prior bound 
on the size of the latter. This result is unaffected by phase transition and 
even by nonstationarity of the joint distribution. The same result holds if 
the penalty term is multiplied by any c > 0; the no underestimation part 
(Proposition 5.1) holds also if log|A n | in the penalty term is replaced by 
any function of the sample size |A n | that goes to infinity as o(|A n |). 

PIC estimation of the basic neighborhood of a Markov random field is to 
a certain extent similar to BIC estimation of the order of a Markov chain, 
and of the context tree of a tree source, also called a variable-length Markov 
chain. For context tree estimation via another method see [5, 22], and via 
BIC, see [9]. There are, however, also substantial differences. The martingale 
techniques in [7, 8] do not appear to carry over to Markov random fields, 
and the lack of an analogue of the Krichevsky-Trofimov distribution used in 
these references is another obstacle. We also note that the "large" boundaries 
of multidimensional sample regions cause side effects not present in the one- 
dimensional case; to overcome those, we have defined the pseudo-likelihood 
function based on a window A n slightly smaller than the whole sample region 

An- 

For Markov order and context tree estimation via BIC, consistency has 
been proved by Csiszar and Shields [8] admitting, for sample size n, all 
k < n as candidate orders (see also [7]), respectively by Csiszar and Ta- 
lata [9] admitting trees of depth o(logn) as candidate context trees. In our 
main result Theorem 2.1, the PIC estimator of the basic neighborhood is de- 
fined admitting candidate neighborhoods of radius o(log 1 ^ 2d ^ |A n |), thus of 
size o(log 1//2 |A n |). The mentioned one-dimensional results suggest that this 
bound on the radius might be relaxed to oilog 1 ^ |A n |), or perhaps dropped 
completely. This question remains open, even for the case d = 1. A positive 
answer apparently depends on the possibility of strengthening our typicality 
result Proposition 3.1 to similar strength as the conditional typicality results 
for Markov chains in [7]. 

More important than a possible mathematical sharpening of Theorem 
2.1, as above, would be to find an algorithm to determine the PIC estimator 
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without actually computing and comparing the PIC values of all candidate 
neighborhoods. The analogous problem for BIC context tree estimation has 
been solved: Csiszar and Talata [9] showed that this BIC estimator can be 
computed in linear time via an analogue of the "context tree maximizing 
algorithm" of Willems, Shtarkov and Tjalkens [23, 24]. Unfortunately, a 
similar algorithm for the present problem appears elusive, and it remains 
open whether our estimator can be computed in a "clever" way. 

Finally, we emphasize that the goal of this paper was to provide a con- 
sistent estimator of the basic neighborhood of a Markov random field. Of 
course, consistency is only one of the desirable properties of an estimator. 
To assess the practical performance of this estimator requires further re- 
search, such as studying finite sample size properties, robustness against 
noisy observations and computability with acceptable complexity. 

Note added in proof. Just before completing the galley proofs, we learned 
that model selection for Markov random fields had been addressed before, 
by Ji and Seymour [18] . They used a criterion almost identical to PIC here 
and, in a somewhat different setting, proved weak consistency under the 
assumption that the number of candidate model classes is finite. 

APPENDIX 

First we indicate how the well-known facts stated in Lemma 2.1 can be 
formally derived from results in [13], using the concepts defined there. 

Proof of Lemma 2.1. By Theorem 1.33 the positive one-point spec- 
ification uniquely determines the specification, which is positive and local 
on account of the locality of the one-point specification. By Theorem 2.30 
this positive local specification determines a unique "gas" potential (if an 
element of A is distinguished as the zero element). Due to Corollary 2.32, 
this is a nearest-neighbor potential for a graph with vertex set "L d defined 
there, and T l is the same as B(i)\{i} in that corollary. □ 

The following lemma is a consequence of the global Markov property. 

Lemma A.l. Let A C 1> d be a finite region with 6 A, and ^ = 
(UjeA^o) \ A. Then for any neighborhood T, the conditional probabilities 
Q(a(i)\a(T i U and Q(a(i)|a((r i n A*) U are equal and translation 
invariant. 

Proof. Since A and \P are disjoint, we have 

Q(a(i)\a(r U **)) = Q{a(i)\a((T n A)' U (tf U (r\A)) 1 )) 

Q(a({z}U(rnA) j )|q((^U(r\A))«)) 

Q(a((rnA)*)K(*u(r\A))*)) ' 
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and similarly 

By the global Markov property (see Lemma 2.1), both the numerators and 
denominators of these two quotients are equal and translation invariant. □ 

The lemma below follows from the definition of Markov neighborhood. 

Lemma A. 2. For a Markov random field with basic neighborhood To, if 
a neighborhood T satisfies 

Q(a(i)\a(T i )) = Q T M^Hn)) 
for all i6Z rf , then T is a Markov neighborhood. 

Proof. We have to show that for any A D V 
(A.l) Q(a(i)\a(A i )) = Q(a(i)\a(T i )). 

Since Tq is a Markov neighborhood, the condition of the lemma implies 
Q(a(i)\a(r)) = Q(a(i)\a(r )) = Q(a(i)\a((T U A)*)). 
Hence (A.l) follows, because r C A C T U A. □ 

Next we state two simple probability bounds. 

Lemma A. 3. Let Z\, Z<i, . . . be {0, l}-valued random variables such that 
ProbjZj = l|Zi, . . . > p* > 0, j> 1, 

with probability 1. Then for any < v < 1 

Probj^jqZ,- <v P Jj < e -Mp*/W->>)\ 

Proof. This is a direct consequence of Lemmas 2 and 3 in the Appendix 
of [7]. □ 

Lemma A. 4. Let Z\, Zi, . . . , Z n be i.i.d. random variables with expec- 
tation and variance D 2 . Then the partial sums 

Sk = Z\ + Z2 H h Zf. 

satisfy 

Probj max S k > Dy/n(n + 2) I < |Prob{S* n > Dy/n/J,}; 
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moreover if the random variables are bounded, \Zi\ <K, then 

Prob{5 n > Dy/n/j,} < 2exp 
where [i < D^/n/K . 



2(l + / uK(2D^)) 2 . 
Proof. See, for example, Lemma VI. 9.1 and Theorem VI. 4.1 in [20]. 



□ 

The following three lemmas are of a technical nature. 

Lemma A. 5. For disjoint finite regions cZ, d and A C Z, d , we have 

Q(a(A)\a($))>q^ n . 

Proof. By induction on |A|. 
For A = {i}, 3 = T l \ <£, we have 

Q(a(i)\a(<S>)) = J2 Q(a(*)|o(*U3))Q(o(H)|o(*)) 

o(H)GA s 

= J2 Q(a(i)\a(r ))Q(a(E)\a(<S>))>q miQ . 

a(H)eA 3 

Supposing Q(a(A)|a($)) > q\^ n holds for some A, we have for {i} U A, 
with S = rj,\($UA), 

Q(a({i}UA)|a(fc)) = ^ QWttUAUS)|a($)) 

a(H)GA H 

= £ Q(a(i)|a(AU3U*))Q(o(AUH)|o(*)). 

a(H)eA H 

Since Q(a(i)\a(A U 3 U <£)) = Q(a(i)|a(rg)) > g mm , we can continue as 

>q min Q(a(A)\am>ql^\ □ 

Lemma A. 6. The number of all possible blocks appearing in a site and 
its neighborhood with radius not exceeding R can be upper bounded as 

\{a(T, 0) G A ru ^ . r(r) <R}\< (\A\ 2 + l)(2«+D d /2. 

Proof. The number of the neighborhoods with cardinality m > 1 and 
radius r(r) < i? is 

/((2i? + l) d -l)/2' 
\ m 
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because the neighborhoods are symmetric. Hence, the number in the propo- 
sition is 

((2_R+l) d -l)/2 . . . . . 

u — ;/ {(2R + l) d -l)/2 



\a\ + \a\. u L ~ }l ) \ A \ 2m 

m=l ^ 



((2 J R+l) d -l)/2 



= |^| V"" |U / - K + i J - j |'| y:1 |2^m 1 ((2i?+l) d -l)/2-m 

m=0 V m / 

Now, using the binomial theorem, the assertion follows. □ 



Lemma A. 7. Let P and Q be probability distributions on A such that 

min aeA Q(a) 



Then 



max |P(a) — Q{a)\ < 



Y, P(a) log ^ < . 1 n( E (P(o) - Q(a)) 2 . 



Proof. This follows from Lemma 4 in the Appendix of [7]. □ 
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